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Matrix sampling of items that is, division of a set of items into different versions of a test 
form is used by several large-scale testing programs. Like other test designs, matrixed 
designs have advantages and disadvantages. For example, testing time per student is 
less than if each student received all the items, but the comparability of student scores 
may decrease. Also, curriculum coverage is maintained, but reporting of scores 
becomes more complex. In this Digest, nine categories of costs associated with matrix 
sampling are discussed: development costs, materials costs, administration costs, 
educational costs, scoring costs, reliability costs, comparability costs, validity costs, and 
reporting costs. 
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Development costs include the cost of writing items, subjecting them to sensitivity and 
technical reviews, pilot and field testing them, and analyzing the pilot and field test 
results. In general, developing more items requires more staff time and more 
participation by schools. In small jurisdictions, developing large numbers of items may 
be particularly burdensome because the cost of developing additional items raises the 
per-student cost of testing more quickly when there are fewer students taking the test; 
and the numbers of schools and students available to pilot and field test new items are 
limited. 

Materials costs include the expense of printing the test booklets and of shipping them to 
schools. Longer tests are more expensive to print and, because the resulting booklets 
are larger and heavier, shipping costs more. In addition, although new computerized 
printing technologies are helping to decrease the costs of printing multiple versions of 
the test booklets, the complexity of preparing multiple versions for printing still means 
that they are more expensive to produce than a single version of a test. 

If a test is to be administered by computer, the materials costs are very different, of 
course. Instead of printing and shipping test booklets, test developers must procure or 
arrange for the use of computers and must set up a computer program to deliver the 
test. Depending on the computerized administration approach, the costs of 
administering different versions of the test may or may not be greater than administering 
a single version. 

Another possibility is to print a single version of a test booklet and instruct different 
students to respond to different sections of the booklet. This avoids the costs of printing 
multiple versions of the test booklet, but results in a longer booklet, and perhaps, some 
confusion for students. 

In addition to test booklets, other materials must be printed and mailed to the schools, 
including instructions for handling the test booklets and administering the test and 
explanations of why the test is being given and how the results of the test should be 
interpreted. Parents may receive materials, either directly or through the schools, 
explaining the purpose of the test and reporting their child's results. Some of these 
materials may be distributed via the Internet, but printed materials are still required for 
parents without Internet access. 

Administration costs include the time teachers and other school personnel must devote 
to preparing for the test administration (reviewing the procedures and sorting the 
materials), administering the test, and returning the materials to the scoring site. 
Depending on the complexity of the test administration, the time required can vary 
widely. If the test is short, requires only paper and pencil, and has only a single version, 
the time to prepare for the administration may be minimal. However, if the test is longer, 
if students must perform tasks in addition to writing responses, and/or if different test 
versions must be administered in a complex pattern, then the time required to prepare 
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for the test administration can be quite substantial. Multiple test versions, by 
themselves, need not increase administration costs, however, if the booklet distribution 
pattern is very simple. 

Educational costs include the time that is taken from other educational activities for test 
preparation and administration. They also include the impact of knowing that the test will 
be administered on the way teachers cover the curriculum. For example, teachers may 
increase the amount of time they spend teaching parts of the curriculum they expect to 
be on the test and decrease the amount of time spent on other concepts. The test may 
also impact the way that a district or school allocates resources. For example, 
resources may be directed disproportionately to the grade levels that will have to take 
the test, if the district or school believes that doing so might improve test results. 

Scoring costs include the costs of scanning and processing "bubble sheets" for 
multiple-choice items, preparing test scoring guides, recruiting and training judges to 
mark student responses to constructed-response items, completing the scoring, and 
checking and processing the results. Scanning multiple-choice responses is relatively 
inexpensive. Flowever, recruiting and training judges to mark students' test responses is 
time-consuming and expensive. As the amount of student work to be marked increases, 
the amount of scoring, of course, increases. Even if the length of the test remains 
constant, if there are multiple versions of the test, then the costs of preparing scoring 
guides, of training judges, and of processing results will increase because the number 
of unique items across test versions will increase. 

The financial and logistical costs of scoring more student work may be partly offset, 
however, if one purpose of the testing program is to provide scoring-related professional 
experience and employment for teachers and/or others. If the amount of scoring 
increases, the amount of employment available will increase. 

Reliability costs. Reliability refers to how accurate and how consistent scores are. Some 
test designs lead to more accurate and consistent scores than other designs. The type 
of accuracy and consistency of interest will depend on the type of score that the test 
must yield. Is the test intended to produce a score on a scale of 1 to 1 00? If so, it may 
be important to know the confidence interval or standard error around the student's 
score. Is the test's purpose to place students in one of several performance levels or 
yield a pass-fail decision? In either case, the accuracy of the decision would be 
important. 

Different levels of scores must also be considered. Many testing programs are required 
by law to produce both student scores and school- or district-level scores. In addition, 
they may provide summary results for the entire jurisdiction. Paradoxically, some test 
designs can increase the reliability of one score level while decreasing reliability at 
another level. In particular, if the number of items an individual student answers is small, 
the reliability of student scores will be low. However, if multiple forms of the test are 
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administered, the number of items contributing to the school- or district-level score may 
be large. As Shoemaker (1971) explains, a classical test theory analysis of the resulting 
data would yield a mean test score for each group of students who happened to take 
the same items, and the mean school test score would be computed as a weighted 
composite of the subgroup scores. The standard error of the school mean test score 
based on the matrixed test would be smaller than the standard error from a test of the 
same length, but in which all student scores were based on the same items. In an item 
response theory (IRT) analysis of the same data, however, administering different items 
to different students would not necessarily improve score reliability for students, 
schools, and districts. To do so without increasing the number of items per student, the 
test would have to become "adaptive" that is, it would avoid administering easy items to 
those students who have better mastery of the material and are almost certain to get 
those items right, and avoid administering hard items to students who are almost certain 
to get them wrong. 

Comparability costs. It is usually assumed that the scores of different students taking a 
test can be compared. Comparability is improved by uniform administration conditions 
and equivalent marking. It also can depend on the particular items that students receive. 
If all students receive the same items, then their scores are easier to compare than if 
they receive different items. The comparability of aggregate scores, such as school- or 
district-level results, is also important to consider. 

The approach chosen to analyze the test results makes a difference. If the items are 
calibrated or equated onto a single scale using item response theory, whether a student 
answered the same or different items should have little effect on comparability. 

However, if classical test theory is used, then the particular items may affect 
comparability. IRT models require at least several hundred responses per item. The 
comparability of performance assessment results, whether reported at the student level 
or in aggregated form, has been addressed by a number of authors (e.g., Bock & 
Mislevy, 1987; Brennan & Johnson, 1995; Cronbach, et al., 1995; Fitzpatrick, Lee, & 
Gao, 2001; Haertel & Linn, 1996; Mislevy, et al., 1992). Haertel and Linn (1996), for 
example, write: 

"Consider the case of a state-level testing program that administers different sets of 
items to different students in order to improve school-level achievement estimates, but 
which also produces individual-level scores. Unless students' scores are based solely 
on items administered to all of them in common, some degree of comparability must be 
assumed across the items given to different students. (There is a dilemma here. The 
more comparable the matrix-sampled items are, the less matrix sampling improves 
content coverage.) (p. 64)." 

In other words, the types of items that will most improve the meaningfulness of 
school-level results may well decrease the comparability of student-level results. 
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Computing statistics to measure the comparability of students' scores can be quite 
complex. Cronbach, et al. (1 995; see also Brennan & Johnson, 1 995, for a similar 
discussion) propose an approach to examining the standard errors of results at the 
student- and school-level, using generalizability analysis. As they point out, the 
comparison of scores for individual students who do not take the same test form 
requires the computation, not just of the standard error of the scores, but the standard 
error of the difference between the scores, which is likely to be considerably larger. 

Validity costs. Validity refers to the extent to which a test is measuring what it is 
intended to measure. Different test designs may be more or less valid for different uses 
and interpretations. A particular concern for the validity of a test intended to measure 
mastery of a school curriculum is how well the test represents the curriculum. If the test 
includes a large number of items sampled from across the curriculum, then there is a 
better chance that the test reflects mastery of the curriculum than if the test includes a 
very few items and so omits large sections of the curriculum. 

The degree to which a test measures the intended construct can also be affected by 
how easily students are able to demonstrate their knowledge on the test. For example, 
confusing test instructions may interfere with students' ability to demonstrate their 
knowledge. Fatigue may also impact student scores if the test is very long, interfering 
with how well the test measures student mastery. Other sources of bias include a test's 
reading level or the inclusion of extraneous concepts that may be less familiar to some 
students than to others. 

Reporting costs. A more complex test design may require more explanatory materials 
and more communication with educators, parents, and the media. This is especially true 
if the complex design supports certain scores at some score levels (e.g., at the school- 
er district-level) and not at others. 

The nine categories of costs will vary in importance depending on the testing program. 
Testing directors and their staffs must examine relevant costs in light of their 
mandate(s), the content of the tests, the financial resources available, and the 
acceptability of the inevitable compromises. 
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